RidgeRun NVIDIA PVA Development Algorithms

From RidgeRun Developer Wiki


Follow Us in Twitter LinkedIn Email Share this page



NVIDIA partner logo






PVA Algorithms from LibPVA

RidgeRun has implemented the following image processing algorithms on the PVA. These are foundational for image signal processing (ISP) pipelines and optimized for high efficiency.


Info
Currently, these algorithms are just for performance evaluation purposes and are not intended to be used in production. Stay tuned for more!


Get access to the FREE evaluation version of the PVA sample binaries here:


All the measurements were taken using the following characteristics:

  • Platform: Jetson AGX Orin 32GB
  • OS: Jetpack 6.2
  • Power Profile: MAXN power mode + Jetson Clocks
  • CPU: All measurements use aggressive compiler optimization flags and OpenMP. Introducing NEON might halve the execution times.
  • PVA: All measurements use a single PVA (with two VPU cores)
  • Power Measurements: using jetson-stats (a tool based on tegrastats) with a VDD_CPU_CV power meter probe.

The profiling details are:

  • Execution time CPU (ms): using one ARM core execution
  • Execution time PVA (ms): using one PVA / two DSP slices (VPU) execution
  • Power Consumption CPU only (W): using a number of cores such that the execution time of the CPU is nearly the same as the PVA (iso-perf).
  • Power Consumption PVA only (W): using a single PVA (with two VPU cores)


Info
Each PVA has two VPUs. All PVA measurements utilise all VPUs (2).


Info
The Performance Ratio (PVA/CPU) dictates how many CPU cores are needed to match performances (iso-perf). i.e. If the performance ratio is 6.5x, the CPU requires ~7 cores to match the performance of the PVA.


Info
The algorithms that run on PVA are inspired from the NVIDIA PVA Solutions. More information about it here.


The power consumption has been acquired at the entire platform level using the jetson-stats Python library.


Bit Shifting (Debayering Resolution Downscaling)

This technique allows for resolution reduction through controlled bit manipulation during debayering. It’s useful in optimizing bandwidth or matching downstream resolution requirements.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. A shift of 10 bits was used for the benchmarks. Performance measurements can also be observed in the attached graph.

Bit Shifting execution time and power consumption. The execution time shows the runtime to complete a transformation on a 16-bit image to an 8-bit image using one CPU core and one PVA (2x VPUs). The power consumption is iso-perf measurements, where the CPU uses six cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Power consumption CPU only (W) Power consumption PVA (W) Power Ratio (CPU/PVA)
1280x720 0.309 0.04865 6.35x 8.75 3.21 2.73x
1920x1080 0.675 0.10678 6.32x 9.14 3.27 2.80x
3840x2160 2.51 0.4061 6.18x 9.54 3.24 2.94x
Fig 1. Bit shifting execution time. The execution time shows the runtime to complete a transformation on a 16-bit image to an 8-bit image using one CPU core and one PVA (2x VPUs).


This downscales a single-channel image from 16-bit to 8-bit. To match the latency of the PVA, it is required to use six ARM cores.

Radial Lens Shading Correction

Corrects vignetting or intensity falloff from the center to the edges of an image caused by lens characteristics. It’s implemented using radial correction maps that are efficiently processed on the PVA.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. Performance measurements can also be observed in the attached graph.

Radial Lens Shading correction execution time and power consumption. The execution time shows the runtime to process an RGB24 image using an 8-bit fixed-point correction. The execution time measurement uses one CPU core and one PVA (2x VPU). The power consumption is iso-perf measurements, where the CPU uses ten cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Power consumption CPU only (W) Power consumption CPU and PVA (W) Power Ratio (CPU/PVA)
1280x720 1.56 0.145 10.75x 8.4 3.69 2.28x
1920x1080 3.5 0.330 10.6x 7.6 3.61 2.11x
3840x2160 13.8 1.402 9.84x 7.2 3.57 2.02x
Fig 2. Radial Lens Shading correction execution time. The execution time shows the runtime to process an RGB24 image using an 8-bit fixed-point correction. The execution time measurement uses one CPU core and one PVA (2x VPU).

The measurements were done with:

  • 8-bit Fixed-point correction maps (including channels)
  • RGB images (RGB24) - 8-bit per channel
  • ARM CPU requires ten ARM cores to match the PVA latency.

Colour Space Conversion (RGBA-Gray)

Transforms image data from one color space to another (e.g., RGB to YUV). It’s essential for encoding, display pipelines, and transmission where non-RGB formats are used.

These implementations showcase how RidgeRun leverages the PVA to create real-time, power-efficient vision pipelines suitable for embedded systems under tight performance constraints.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. In the example measurements, an RGBA to Grayscale conversion was performed. Performance measurements can also be observed in the attached graph.

RGBA to Grayscale conversion execution time and power consumption. The execution time shows the runtime to process an RGB32 image to a GRAY8 image. The execution time measurement uses one CPU core and one PVA (2x VPUs). The power consumption is iso-perf measurements, where the CPU uses twelve cores to match the PVA latency.
Resolution Execution time CPU (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Power consumption CPU only (W) Power consumption PVA (W) Power Ratio (CPU/PVA)
1280x720 1.36 0.085 16.0x 10.35 3.97 2.6x
1920x1080 3.05 0.195 15.6x 10.74 3.84 2.8x
3840x2160 12.1 0.746 16.2x 10.74 3.61 2.98x
Fig 3. RGBA to Grayscale conversion execution time. The execution time shows the runtime to process an RGB32 image to a GRAY8 image. The execution time measurement uses one CPU core and one PVA (2x VPUs).

The images involved:

  • Input: RGBA32 (8-bit per channel, four channels)
  • Output: Gray8 (8-bit single channel)
  • The CPU requires twelve cores to closely match the PVA's latency.

Colour Space Conversion (YUY2-RGBA)

Transforms image data from one color space to another (e.g., YUY2 to RGBA). It’s essential for encoding, display pipelines, and transmission where non-RGB formats are used.

These implementations showcase how RidgeRun leverages the PVA to create real-time, power-efficient vision pipelines suitable for embedded systems under tight performance constraints.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. In the example measurements, a YUY2 - RGBA conversion was performed. Performance measurements can also be observed in the attached graph.

YUY2 to RGBA conversion execution time and power consumption. The execution time shows the runtime to process a YUY2 image to a RGBA32 image. The execution time measurement uses one CPU core and two VPUs. The power consumption is iso-perf measurements, where the CPU uses twelve cores to nearly match the PVA latency.
Resolution Execution time CPU (ms) Execution time VIC (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Performance Ratio (VIC/PVA) Power consumption CPU only (W) Power consumption VIC only (W) Power consumption PVA (W) Power Ratio (CPU/PVA) Power Ratio (VIC/PVA)
1280x720 12.7 2.0 0.187 67.91x 10.7x 11.12 2.385 3.578 3.1x 0.67x
1920x1080 28.6 5.2 0.423 67.6x 12.3x 11.12 2.385 3.578 3.1x 0.67x
3840x2160 113.3 16.3 1.663 68.1x 9.8x 11.12 2.385 3.578 3.1x 0.67x
Fig 4. YUY2 to RGBA32 conversion execution time. The execution time shows the runtime to process a YUY2 to RGBA32 image. The execution time measurement uses one CPU core and one PVA (2x VPU)
RGBA to YUY2 conversion execution time and power consumption. The execution time shows the runtime to process an RGBA32 image to a YUY2 image. The execution time measurement uses one CPU core and one PVA (2x VPU). The power consumption is iso-perf measurements, where the CPU uses twelve cores to nearly match the PVA latency.
Resolution Execution time CPU (ms) Execution time VIC (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Performance Ratio (VIC/PVA)
1280x720 4.82 1.3 0.157 30.9x 8.28x
1920x1080 10.7 2.8 0.334 32.0x 8.38x
3840x2160 42.7 10.5 1.271 33.6x 8.26x
Fig 4. RGBA32 to YUY2 conversion execution time. The execution time shows the runtime to process a RGBA32 to YUY2 image. The execution time measurement uses one CPU core and one PVA (2x VPU).

The images involved:

  • Input/Output: YUY2 (YUV 4:2:2 interleaved)
  • Output/Input: RGBA32 (8-bit per channel, four channels)
  • The CPU requires twelve cores to match the PVA's latency. The implementation was the GStreamer's videoconvert element due to its popularity and relevance in ISP.

2D Filtering (Convolution)

Applies a 2D filter using a non-separable 5x5 kernel, it can be used for general image filtering as well as showcasing general 2D convolution performance.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized version of the algorithm, and all results are in milliseconds. Additionally, better performance can be achieved with further optimization as shown in NVIDIA's PVA Solutions implementation of the convolution. Performance measurements can also be observed in the attached graph.

The CPU implementation is based on cv::filter2D with a 5x5 non-separable kernel.

2D Filter convolution execution time and power consumption. The execution time shows the runtime to process an 8-bit gray image. The execution time measurement uses one CPU core and one VPUs.
Resolution Execution time CPU (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU)
1280x720 6.84 0.179 38.2x
1920x1080 15.37 0.378 40.66x
3840x2160 61.64 1.503 41.1x
Fig 3. 5x5 non-separable convolution 2D filter execution time. The execution time shows the runtime to process a GRAY8 image. The execution time measurement uses one CPU core and one PVA (2x VPUs).


The images involved:

  • Input: Gray8 (16-bit single channel)
  • Output: Gray8 (16-bit single channel)
  • The CPU code uses OpenCV functions (cv::filter2D).

Black Level Correction

Corrects sensor black level offsets by normalizing the pixel intensity baseline, ensuring true blacks and accurate dark-region details. It’s implemented using selectable offset adjustments that are efficiently processed on the PVA.

Average performance measurements are shown in the following table for the most common resolutions. Measurements are shown for an optimized implementation of the algorithm, and all results are in milliseconds. Additionally, power consumption measurements are shown in watts. Performance measurements can also be observed in the attached graph.

Black Level correction execution time and power consumption. The execution time shows the runtime to process an RGB24 image using a 4-bit fixed-point correction. The execution time measurement uses one CPU core and one PVA (2x VPU). The power consumption is iso-perf measurements, where the CPU uses twelve threads.
Resolution Execution time CPU (ms) Execution time PVA (ms) Performance Ratio (PVA/CPU) Power consumption CPU only (W) Power consumption CPU and PVA (W) Power Ratio (CPU/PVA)
1280x720 3.2 0.116 27.58x 10.4 3.21 3.24x
1920x1080 7.2 0.263 27.37x 10.4 3.20 3.25x
3840x2160 28.6 1.059 27.0x 10.8 3.21 3.36x
Fig 6. Black Level correction execution time. The execution time shows the runtime to process an RGB24 image using a 4-bit fixed-point correction. The execution time measurement uses one CPU core and one PVA (2x VPU).

The measurements were done with:

  • 4-bit Fixed-point correction maps (including channels)
  • RGB images (RGB24) - 8-bit per channel
  • ARM CPU requires twelve ARM cores to match the PVA latency.

Access to PVA Solutions

Access to the NVIDIA PVA Solutions has allowed us to push performance boundaries significantly, guided by the insights provided in each example by the PVA architects. For instance, in the RGBA-to-Greyscale colour space conversion, we reduced the execution time from 8 ms at 1080p to just 0.746 ms—a 10x speedup. This improvement was achieved by leveraging the diverse programming techniques available on the PVA, as demonstrated through the Solutions. A similar gain was observed in the 2D Filter, where execution time was reduced by half, delivering a 2x speedup.

Final Remarks

The PVA not only outperforms the CPU in performance per watt, but also delivers faster execution—particularly when leveraging its inherently parallel architecture. By combining speed and energy efficiency, it provides a strong advantage in embedded vision systems. Here are some remarks:

  • Energy efficiency (performance per watt)

The PVA delivers significantly higher efficiency compared to conventional CPUs. Thanks to its VLIW SIMD architecture, purpose-built for computer vision tasks, it achieves high throughput with low power consumption—an essential feature for embedded applications requiring autonomy, continuous operation, and low latency.

  • CPU and GPU offloading

The PVA offloads many pre- and post-processing tasks (such as filtering, remapping, pyramids, lens correction, etc.) from the CPU and GPU. This frees system resources for more demanding workloads like AI inference, improving overall system efficiency.

  • Tailored for real-time embedded vision

With it's programmability by way of an optimizing C/C++ compiler, specialized fixed function units such as DLUT, low power profile, and deterministic execution, the PVA is an excellent fit for real-time, continuous vision applications in domains such as autonomous mobility, robotics, and surveillance—where both responsiveness and energy efficiency are critical